Incremental ETL Pipeline Scheduling for Near Real-Time Data Warehouses
نویسندگان
چکیده
We present our work based on an incremental ETL pipeline for on-demand data warehouse maintenance. Pipeline parallelism is exploited to concurrently execute a chain of maintenance jobs, each of which takes a batch of delta tuples extracted from source-local transactions with commit timestamps preceding the arrival time of an incoming warehouse query and calculates Ąnal deltas to bring relevant warehouse tables up-to-date. Each pipeline operator runs in a single, non-terminating thread to process one job at a time and re-initializes itself for a new one. However, to continuously perform incremental joins or maintain slowly changing dimension tables (SCD), the same staging tables or dimension tables can be concurrently accessed and updated by distinct pipeline operators which work on diferent jobs. Inconsistencies can arise without proper thread coordinations. In this paper, we proposed two types of consistency zones for SCD and incremental join to address this problem. Besides, we reviewed existing pipeline scheduling algorithms in our incremental ETL pipeline with consistency zones.
منابع مشابه
Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses
Multi-version concurrency control method has nowadays been widely used in data warehouses to provide OLAP queries and ETL maintenance flows with concurrent access. A snapshot is taken on existing warehouse tables to answer a certain query independently of concurrent updates. In this work, we extend this snapshot with the deltas which reside at the source side of ETL flows. Before answering a qu...
متن کاملNear-real-time Parallel Etl+q for Automatic Scalability in Bigdata
In this paper we investigate the problem of providing scalability to near-real-time ETL+Q (Extract, transform, load and querying) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically during small fixed time windows. We propose an approach to enable the automatic scalability and freshness of any data warehouse a...
متن کاملNext-generation ETL Framework to Address the Challenges Posed by Big Data
The specific features of Big Data i.e., variety, volume, and velocity call for special measures to create ETL data pipelines and data warehouses. A rapidly growing need for analyzing Big Data calls for novel architectures for warehousing the data, such as data lakes or polystores. In both of the architectures, ETL processes serve similar purposes as in traditional data warehouse architectures. ...
متن کاملEfficient ETL+Q for Automatic Scalability in Big or Small Data Scenarios
In this paper, we investigate the problem of providing scalability to data Extraction, Transformation, Load and Querying (ETL+Q) process of data warehouses. In general, data loading, transformation and integration are heavy tasks that are performed only periodically. Parallel architectures and mechanisms are able to optimize the ETL process by speedingup each part of the pipeline process as mor...
متن کاملStriving towards Near Real-Time Data Integration for Data Warehouses
The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed ...
متن کامل